This document is a formal write-up for the final project of BST 270 - Reproducible Data Science, a biostatistics course taught by Dr. Viola Fanfani in the Harvard TH Chan School of Public Health. The objective of the project is to replicate a set of figures that were published by the New York Times during the COVID-19 pandemic. In addition to critiquing the reproducibility of the Times’ data visualization, we have created a publicly-available GitHub repository so that our own work can be easily understood and reproduced by anyone interested. Without further ado, let us proceed to the first of our five required objectives.
Our first task is to reproduce the following figure. The graph shows a time series of the rolling 7-day average of daily COVID cases nation-wide. Because the Times does not state explicitly what kind of average this is (i.e., weighted vs. unweighted, left-side vs. centered), we will assume that each point represents the average number of cases per day in a week-long period centered on the day that is being measured.
The first step in reproducing this figure is going to be to load the
necessary packages and scripts. We use tidyverse for data
cleaning and visualization, zoo for computing rolling
averages, bigreadr for reading large datasets to data
frames efficiently, dtplyr for speeding up
dplyr code with data.table, which is also
loaded, and usmap and maps for some of our
later plots. We also source two files containing helper functions that
will come in handy for wrangling the data.
# Load required packages and scripts
library(tidyverse) # Data wrangling / plotting
library(zoo) # Compute rolling averages
library(bigreadr) # Read large CSVs to data frames
library(data.table) # Needed for dtplyr
library(dtplyr) # Speed up dplyr code with data.table backend
library(usmap) # Used for plotting maps
library(maps) # Use for plotting maps
# Load external functions
source("helper_scripts/get_count_pop_data.R")
source("helper_scripts/get_county_covid_data.R")
The second step is to load data from the New York Times using the
specified link, compute rolling averages using the rollmean
function, and simplify the dataset to include only the two variables
that we need. We also subset the data to specify the desired dates,
which appeared to be, approximately, January 24th, 2020 to January 5th,
2022.
# Load NYT COVID data
covid <-
fread("https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us.csv") |>
lazy_dt() #enables data.table backend
# Compute rolling average of new cases
avg_cases <-
covid |>
# Compute rolling average
transmute(
day = as.Date(date),
new_cases_av = rollmean(cases, k = 7, fill = NA), #rolling avg
) |>
# Select needed dates
filter(
day <= as.Date("2022-01-05"),
day >= as.Date("2020-01-24")
)
head(as.data.frame(avg_cases))
## day new_cases_av
## 1 2020-01-24 0.7142857
## 2 2020-01-25 0.5714286
## 3 2020-01-26 0.5714286
## 4 2020-01-27 0.7142857
## 5 2020-01-28 0.7142857
## 6 2020-01-29 0.7142857
Finally we are able to make the plot with ggplot. Our
figure looks very close to that presented by the New York Times, but
perhaps a little less smooth. However, since we do not know exactly how
they computed their rolling averages, this appears to be the best that
we can do. Overall the figure is very reproducible but for a few
potentially minor discrepancies between June and October of 2021.
# Plot cases
avg_cases |>
as_tibble() |>
ggplot(aes(x = day, y = new_cases_av)) +
geom_line(color = "red") +
geom_area(fill = "pink", alpha = 0.5) +
scale_y_continuous(breaks = c(2e5,4e5,6e5),
labels = c("200,000","400,000","600,000"))+
scale_x_date(breaks = as.Date(c("2020-02-01","2020-06-01","2020-10-01",
"2021-02-01","2021-06-01","2021-10-01")),
labels = c("Feb. 2020", "Jun.", "Oct.", "Feb. 2021", "Jun.","Oct.")
) +
theme_minimal() +
theme(
axis.title = element_blank(),
axis.ticks.x = element_line(color = "gray30"),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_line(color = "gray80", linetype = 2),
axis.line.x = element_line(linewidth = 1, color = "black")
)
Task two is to recreate the following table, focusing on cases and deaths.
The first step in doing so is
to compute the daily average on December 29th, 2021 so that it can be
contrasted with January 12th, 2021 and we can compute a 14-day change.
Sadly, after plenty of experimentation, I was unable to get their exact
numbers because I could not figure out what sort of “daily average” they
used. I settled on taking the average of the previous 7 days, which is
different than what I used in Task 1 but appeared to be what the authors
used in Tasks 3 and 4 (below). Computing this average involved the
following code:
covid$date <- as.Date(covid$date)
# Compute average cases and deaths 12/29/21 (needed for pct change)
avg_12_29 <-
covid |>
filter(
date <= as.Date("2021-12-29"),
date >= as.Date("2021-12-23")
) |>
summarize(
cases = mean(cases),
deaths = mean(deaths)
) |>
as.data.frame()
avg_12_29
## cases deaths
## 1 301502.6 1543.429
Now that we have the average for December 29th, we can compute the average for January 12th and the percent change. The results are shown below.
avg_12_29 <- as.vector(t(avg_12_29))
# Compute average cases and deaths 01/12/22 and pct change
covid |>
filter(
date <= as.Date("2022-01-12"),
date >= as.Date("2022-01-06")
) |>
summarize(cases = mean(cases), deaths = mean(deaths)) |>
pivot_longer(cols = everything(), values_to = "avg", names_to = "Variable") |>
mutate(change = paste0(round(100 * (avg - avg_12_29) / avg_12_29), "%")) |>
rename(`Daily Avg. Jan. 12` = avg, `14-Day Change` = change) |>
as.data.frame()
## Variable Daily Avg. Jan. 12 14-Day Change
## 1 cases 786550.000 161%
## 2 deaths 1838.857 19%
The number of COVID-19 cases resembles closely the number presented by the New York Times (781,203), as does the percent change (159%). The estimated number of deaths appears to match, too, but the percentage change is far lower than it should be (51%). How the New York Times arrived at that number, one cannot know, and the table’s reproducibility is weakened by the fact that we do not know what sort of average they have used. In any case, it is clear that this table is not as intuitive to figure out or as easy to replicate as the time series figure.
Next, our third task is to reproduce this geographic illustration of nation-wide COVID rates, in which each US county is colored in accordance to its average daily COVID cases per 100,000 people in the past week.
The first step in reproducing this figure is to download county-level COVID data from the New York Times. The code for downloading this data is omitted here for brevity, but can be examined in the file helper_scripts/get_count_covid_data.R.
# Load time-stamped county covid data
county_covid <-
get_county_covid_data() |>
lazy_dt()
head(county_covid)
## Source: local data table [6 x 6]
## Call: head(`_DT3`, n = 6L)
##
## date geoid region subregion cases deaths
## <IDate> <chr> <chr> <chr> <int> <int>
## 1 2022-01-01 USA-72999 puerto rico unknown 0 0
## 2 2022-01-01 USA-72153 puerto rico yauco 0 0
## 3 2022-01-01 USA-72151 puerto rico yabucoa 0 0
## 4 2022-01-01 USA-72149 puerto rico villalba 0 0
## 5 2022-01-01 USA-72147 puerto rico vieques 0 0
## 6 2022-01-01 USA-72145 puerto rico vega baja 0 0
##
## # Use as.data.table()/as.data.frame()/as_tibble() to access results
The second step is to compute daily average COVID rates in the past week on January 12th. We do so by filtering to the appropriate range of days, excluding regions outside the continental United States, and averaging the number of cases over all those days within each state-specific subregion. This produces estimates of the average number of COVID cases per day in the past week.
# regions to omit
exclude_regions <- c("Puerto Rico", "American Samoa", "Northern Mariana Islands",
"Virgin Islands")
# daily average COVID cases in past week
county_covid_daily <-
county_covid |>
mutate(date = as.Date(date)) |>
filter(
date <= as.Date("2022-01-12"),
date >= as.Date("2022-01-06"),
! region %in% exclude_regions
) |>
group_by(region, subregion) |>
summarize(
daily_cases = mean(cases)
) |>
ungroup() |>
as_tibble()
head(county_covid_daily)
## # A tibble: 6 × 3
## region subregion daily_cases
## <chr> <chr> <dbl>
## 1 alabama autauga 100.
## 2 alabama baldwin 434.
## 3 alabama barbour 57.3
## 4 alabama bibb 52.3
## 5 alabama blount 81.6
## 6 alabama bullock 24.3
Finally, in order to convert those case rates to per capita, we load county-level population data from the US Census bureau, then merge that by county with our current dataset. (If you are interested in seeing code for this, see helper_scripts/get_count_pop_data.R). We then divide the previous estimate of cases per day by the population and multiply by one hundred thousand, giving the final number.
# Load county population data with latitude / longitude info
county_pop <- get_county_pop_data() |> lazy_dt()
# Merge and compute average daily cases per person
county_covid_daily_pp <-
county_covid_daily |>
lazy_dt() |>
left_join(y = county_pop, by = c("region", "subregion")) |>
mutate(daily_cases_pp = daily_cases / population * 1e5) |>
select(region, subregion, daily_cases_pp)
head(as_tibble(county_covid_daily_pp))
## # A tibble: 6 × 3
## region subregion daily_cases_pp
## <chr> <chr> <dbl>
## 1 alabama autauga 180.
## 2 alabama baldwin 195.
## 3 alabama barbour 232.
## 4 alabama bibb 233.
## 5 alabama blount 141.
## 6 alabama bullock 240.
Now, in order to convert this data frame to a plot, we must load US
county map coordinates using the map_data() function from
the ggplot2 package. Then we merge this with the
county-level COVID data so that each COVID rate is matched to a latitude
and longitude value.
# Load latitude / longitude data for plotting
county_location <- map_data("county") |> lazy_dt()
plot_data <-
county_location |>
left_join(county_covid_daily_pp, by = c("region", "subregion")) |>
as_tibble()
head(plot_data)
## # A tibble: 6 × 7
## long lat group order region subregion daily_cases_pp
## <dbl> <dbl> <dbl> <int> <chr> <chr> <dbl>
## 1 -86.5 32.3 1 1 alabama autauga 180.
## 2 -86.5 32.4 1 2 alabama autauga 180.
## 3 -86.5 32.4 1 3 alabama autauga 180.
## 4 -86.6 32.4 1 4 alabama autauga 180.
## 5 -86.6 32.4 1 5 alabama autauga 180.
## 6 -86.6 32.4 1 6 alabama autauga 180.
Finally we are able to make the desired plot. Forgetting the color scheme, our results for the most part look similar those obtained by the New York Times: There are light patches in Maine, Nevada, Utah, and Montana and dark patches in Florida, Arizona, New Mexico, and along the Eastern Seaboard. However, our figures reveals that we have a few counties with missing data, which the Times did not. Overall, their figure is somewhat reproducible, but it is difficult to generate their exact results because we don’t know how exactly their average is computed and it is difficult to recreate their exact color scale for comparison.
# Plot!
plot_data |>
mutate(daily_cases_pp = pmin(daily_cases_pp, 300)) |>
ggplot(aes(x = long, y = lat, group = group, fill = daily_cases_pp)) +
geom_polygon(color = "white", linewidth = .01 ) +
scale_fill_distiller(palette = 8, direction = 1) +
theme_minimal() +
labs(fill = "Average Daily Cases") +
theme(
panel.grid = element_blank(),
axis.text = element_blank(),
legend.position = "top",
axis.title = element_blank()
)
Our last task is to recreate the table below, focusing on COVID-19 cases and deaths.
The first step is to compute the state average daily case rates. This can be done easily with the code below.
# Compute average daily cases
state_daily_cases <-
county_covid |>
# select dates
filter(date <= as.Date("2022-01-12"), date >= as.Date("2022-01-06")) |>
# sum over all counties for each region-date
group_by(region, date) |>
summarize(cases = sum(cases), deaths = sum(deaths)) |>
# average over all dates
group_by(region) |>
summarize(cases = mean(cases), deaths = mean(deaths))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
head(as_tibble(state_daily_cases))
## # A tibble: 6 × 3
## region cases deaths
## <chr> <dbl> <dbl>
## 1 alabama 10339. 19.7
## 2 alaska 1857 0.857
## 3 american samoa 0 0
## 4 arizona 14972. 60.3
## 5 arkansas 7454. 16.6
## 6 california 107012. 89.7
The next step is to compute state populations, which will be used to sort our results by cases per capita like in the figure.
# Compute state populations
state_pop <-
county_pop |>
group_by(region) |>
summarize(pop = sum(population)) |>
ungroup()
# Sort by cases per capita
state_daily_cases1 <-
state_daily_cases |>
left_join(state_pop, by = "region") |>
mutate(cases_pp = cases / pop * 1e5, death_pp = deaths / pop * 1e5) |>
arrange(desc(cases_pp)) |>
select(region, cases, deaths) |>
slice(1:10)
head(as_tibble(state_daily_cases1))
## # A tibble: 6 × 3
## region cases deaths
## <chr> <dbl> <dbl>
## 1 rhode island 5349. 6.29
## 2 new york 70655. 165.
## 3 massachusetts 23793. 53
## 4 new jersey 29097 75.4
## 5 delaware 3004. 13
## 6 florida 65551. 39.6
Finally, we compute average results across the United States and place them at the top of the table, shown below. The results are surprisingly good – we match the exact numbers for several states and are decently close for the United States as a whole. Though we cannot reproduce the table exactly, their findings do appear to be mostly reproducible.
state_daily_cases <- as_tibble(state_daily_cases)
# Compute US average results
us_daily <-
data_frame(
region = "United States",
cases = sum(state_daily_cases$cases),
deaths = sum(state_daily_cases$deaths)
)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
# Beautify final table
final_table <-
bind_rows(us_daily, as_tibble(state_daily_cases1)) |>
transmute(
Region = str_to_sentence(region),
`Daily Avg. Cases` = round(cases, 0),
`Daily Avg. Deaths` = round(deaths, 1)
)
final_table
## # A tibble: 11 × 3
## Region `Daily Avg. Cases` `Daily Avg. Deaths`
## <chr> <dbl> <dbl>
## 1 United states 786478 1825.
## 2 Rhode island 5349 6.3
## 3 New york 70655 165.
## 4 Massachusetts 23793 53
## 5 New jersey 29097 75.4
## 6 Delaware 3004 13
## 7 Florida 65551 39.6
## 8 Vermont 1746 1.1
## 9 Utah 8939 11.6
## 10 Hawaii 3868 2.3
## 11 California 107012 89.7
Our final task is to comment on the reproducibility of the set of four figures altogether. In Task (1), we were able to match the overall shape of the time series very well, even if ours seemed a little more jagged in places. In Task (2), however, our estimates of the daily death rate and percent change were very different than what was presented by the Times, and it is difficult to say how exactly they arrived at their numbers. This is even more confusing because our estimates for COVID cases, not deaths, seemed to match what they presented surprisingly well. Next, in Task (3), our map of COVID cases across the United States looked similar to theirs in some key locations, but was difficult to fully compare to their figure because of its unique color scheme. It seems reasonable that their results could be correct, but it is hard to say for sure without access to the code they used to generate the figure. Finally, in Task (4), our estimates of the daily case rates and death rates for the US as a whole and specific states seemed to match the New York Times results very closely, and the top 10 states in averaged daily cases per capita was the same as their top 10.
In sum, the figures presented by the New York Times are somewhat reproducible, but not as reproducible as they could be and should be. Without additional information such as code or statistical details, it is very difficult to get results that match the exact numbers they arrived at. In today’s era of open source data science, the work presented by the New York Times does meet the appropriate standards of reproducibility, an issue which is even more critical when presenting data related to public health.